Hate Speech Detection and Classification System in Amharic Text with Deep Learning
No Thumbnail Available
Date
2021-08-19
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Addis Ababa University
Abstract
Social media is becoming the main source of information intake, allowing users to share
their views freely and widely. However, the unregulated nature of this information access
is making social media platforms a ground for the proliferation of hate speech and fake
news. It is evident that online hate speech could materialize to an offline impact beyond
its psychological effects on victims. More particularly, for multi-nation and multireligious
as well as less democratic countries like Ethiopia, hate speech is causing drastic
consequences by triggering or igniting conflicts. Detecting hate speeches for resourceful
languages like English is getting better due to the availability of trained models and
enough moderators. In the case of Ethiopia, except for the new declaration of hate speech
proclamation, there are no automated hate speech detection mechanisms for the local
languages, including Amharic which is the official working language of Ethiopia. One of
the mitigation efforts to decrease the effect of hate speech in Ethiopia was to shut down
Internet connections, which happened several times in the past.
The development of a hate speech detection system for Amharic will be a solution in many
aspects. 1) The system helps policymakers and peacemakers to automatically detect and
act when hate speech comments are circulating on the Internet. 2) It also will help social
media platform owners such as Facebook and Twitter to automatically flagging hate
speech comments before it reaches larger audiences.
Even if hate speech is a global issue, the systems which are developed for English or other
languages cannot be directly applied to detect hate speeches in Amharic. So, we need to
have a new home-grown solution. Taking this into consideration, we developed a system
that can detect and classify text into four categories. The system is developed using
Stacked Bidirectional Long Short Term Memory Networks (SBi-LSTM) which is a variety
of Deep Learning based machine learning methods. This system is compared against two
of our baseline detection systems which are developed using dummy classifiers and
classical machine learning approaches. The deep learning system has achieved a greater
accuracy result than the other systems. This deep learning system has shown a promising
result by achieving a 94.8% F1-score accuracy result using fastText word embedding for
vector representation. For the development of the system, we have collected and
annotated 5,000 Amharic corpus data into racial, religion, gender and normal speech
categories using our own custom annotation tool using 100 annotators.
Our system has enabled multi-label categorical classification of hate speech which is
useful to get statistical information for any responsible organization to focus on the
vulnerable group, society or religion. Having a hate speech classification system
development for Amharic is challenging due to the unavailability of an annotated dataset,
morphologically richness of Amharic and there was no Amharic hate speech classification
study that could be used as a baseline. Our system can be improved by having a more
dataset and by adding more other training layers in addition to the SBi-LSTM layer.
Description
Keywords
Amharic Hate Speech Detection and Classification, Amharic Post and Comment Dataset, Deep Learning, Machine Learning, RNN, Stacked Bidirectional LSTM