Coreference Resolution for Amharic Text Using Bidirectional Encoder Representation from Transformer
No Thumbnail Available
Date
3/4/2022
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Addis Ababa University
Abstract
Coreference resolution is the process of finding an entity which is refers to the same entity in a text. In
coreference resolution similar entities are mention. The task of coreference resolution is clustering all
similar mentions in a text based on the index of a word. Coreference resolution is used for several
Natural Language Processing (NLP) applications like machine translation, information extraction,
name entity recognition, question answering and others to increase their effectiveness. In this work, we
have proposed coreference resolution for Amharic text using bidirectional encoder representation from
transformer (BERT). This method is a contextual language model that generates the semantic vectors
dynamically according to the context of the words.
The proposed system model has training and testing phase. The training phase includes preprocessing
(cleaning, tokenization and sentence segmentation), word embedding, feature extraction Amharic
vocabulary, entity and mention-pair and coref model. Like training phase, testing phase has its own step
such as preprocessing (cleaning, tokenization and sentence segmentation) and coreference resolution as
well as Amharic predicted mention. The use of word embedding in the proposed model is that it
represent each word into a low dimension vector. It is a feature learning technique to obtain new
features across domains for coreference resolution in Amharic text. Necessary informations are
extracted from word embedding and processed data as well as Amharic characters. After we extract
important features from training data we build a coreference model. Moreover, in the model
bidirectional encoder representation from transformer is used to obtain basic features from embedding
layer by extracting various information from both the left and right direction of the given word.
To evaluate the proposed model, we conduct the experiment using Amharic dataset, which is prepared
from various reliable sources for this study. The commonly used evaluation metrics for coreference
resolution task are MUC, B3, CEAF-m, CEAF-e and BLANC. Experimental result demonstrate that the
proposed model outperformed state-of-the-art Amharic model achieving 80%, 85.71%, 90.9%, 88.86%
and 81.7% F-measure values respectively on the Amharic dataset.
Description
Keywords
Amharic Coreference Resolution, Mention, Bidirectional Encoder Representation from Transformer, Transformer, Nlp, Coreference, Word Embedding