Hate Speech Detection and Classification System in Amharic Text with Deep Learning

No Thumbnail Available

Date

2021-08-19

Journal Title

Journal ISSN

Volume Title

Publisher

Addis Ababa University

Abstract

Social media is becoming the main source of information intake, allowing users to share their views freely and widely. However, the unregulated nature of this information access is making social media platforms a ground for the proliferation of hate speech and fake news. It is evident that online hate speech could materialize to an offline impact beyond its psychological effects on victims. More particularly, for multi-nation and multireligious as well as less democratic countries like Ethiopia, hate speech is causing drastic consequences by triggering or igniting conflicts. Detecting hate speeches for resourceful languages like English is getting better due to the availability of trained models and enough moderators. In the case of Ethiopia, except for the new declaration of hate speech proclamation, there are no automated hate speech detection mechanisms for the local languages, including Amharic which is the official working language of Ethiopia. One of the mitigation efforts to decrease the effect of hate speech in Ethiopia was to shut down Internet connections, which happened several times in the past. The development of a hate speech detection system for Amharic will be a solution in many aspects. 1) The system helps policymakers and peacemakers to automatically detect and act when hate speech comments are circulating on the Internet. 2) It also will help social media platform owners such as Facebook and Twitter to automatically flagging hate speech comments before it reaches larger audiences. Even if hate speech is a global issue, the systems which are developed for English or other languages cannot be directly applied to detect hate speeches in Amharic. So, we need to have a new home-grown solution. Taking this into consideration, we developed a system that can detect and classify text into four categories. The system is developed using Stacked Bidirectional Long Short Term Memory Networks (SBi-LSTM) which is a variety of Deep Learning based machine learning methods. This system is compared against two of our baseline detection systems which are developed using dummy classifiers and classical machine learning approaches. The deep learning system has achieved a greater accuracy result than the other systems. This deep learning system has shown a promising result by achieving a 94.8% F1-score accuracy result using fastText word embedding for vector representation. For the development of the system, we have collected and annotated 5,000 Amharic corpus data into racial, religion, gender and normal speech categories using our own custom annotation tool using 100 annotators. Our system has enabled multi-label categorical classification of hate speech which is useful to get statistical information for any responsible organization to focus on the vulnerable group, society or religion. Having a hate speech classification system development for Amharic is challenging due to the unavailability of an annotated dataset, morphologically richness of Amharic and there was no Amharic hate speech classification study that could be used as a baseline. Our system can be improved by having a more dataset and by adding more other training layers in addition to the SBi-LSTM layer.

Description

Keywords

Amharic Hate Speech Detection and Classification, Amharic Post and Comment Dataset, Deep Learning, Machine Learning, RNN, Stacked Bidirectional LSTM

Citation