Authorship Attribution Model for Amharic Documents using Machine Learning
No Thumbnail Available
Date
11/6/2020
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Addis Ababa University
Abstract
These days, text documents are being produced anonymously through different sources like the Internet. They are available in different forms without the rightful owner of the text being known. These anonymous texts can be emails, letters, harassing messages, suicide notes or literary works created using different languages. Identifying the true author of such an anonymous text involves analyzing the writings through the various authorship attribution techniques. However, the forms of these texts, and the type and nature of the language that are used to create them make the process challenging. So far, in the Amharic language, researchers tackled the problem of topic based text classification to some extent. However, style based text classification tasks, like the authorship attribution problem, hasn’t been given much consideration.
This study is aimed at designing an Amharic authorship attribution model that is capable of identifying authors of anonymous Amharic documents using machine learning. The architecture is composed of two phases (a training and an attribution phase) comprising different components: Preprocessing, Feature Extraction, Feature Concatenation, Dimension Reduction, Classifier Training, Author Profiling and Authorship Attribution. Different types of n-gram features (word, character, part of speech, punctuation and space), the different combination of these n-grams and n-gram based poem specific features are considered. The training phase involves extracting sets of features from the author dataset creating an author profile to train a classifier with. The attribution phase involves extracting sets of features from a given anonymous test document and attribute an author from a set of candidate authors.
A prototype of the attribution model is developed to test and evaluate its practicality. The model is experimented using a dataset of more than 2,000 documents of 20 different authors and 120+ poems of 2 poets for the different n-gram features. The model is tested using support vector machine and has achieved an accuracy of 86.77 % for the combination of char 3-gram and word_plus_pos 4-gram features using support vector machine classifier for dataset 1 and an accuracy of 0.96 for poem dataset for poem specific features.
Description
Keywords
Machine Learning, Text Classification, Stylometry, Authorship Analysis, Authorship Attribution