Unsupervised Text Document Clustering Using Encyclopedic Knowledge With Word Embedding

No Thumbnail Available



Journal Title

Journal ISSN

Volume Title


Addis Ababa University


Digital technologies have made very easy and cheap to generate, store and publish different kinds of data. Thus, almost in every discipline, people are using automated systems that generate information represented in text format in different natural languages. As a result, there is a growing interest towards better solutions for finding, organizing and analyzing these text documents. The effective ways of rearranging the huge amount of text document form later processing, navigating and browsing less complicated, friendly and efficient. Text document clustering is one of the common methods of organizing text documents. In recent years, Encyclopedic Knowledge (EK) is used in different data mining tasks including text document clustering. Moreover, with the recent advances in machine learning, word embedding is a modern approach for feature learning techniques in natural language documents that is built on the idea that semantics of a word arise simply from its context. Previous works on text clustering do not consider the advantages of using EK with word embedding. In order to improve the performance of text document clustering, this study propose a system that clusters text documents using EK with neural word embedding. EK enables the representation of different related concepts and neural word embedding is used to handle the contexts of these relatedness. During the clustering process, all the text documents pass through pre-processing stages. Then enriched text document features were extracted from each document through mapping with EK and trained word embedding model. Finally, text documents are clustered using the most popular spherical K-means algorithm, that is based on the cosine similarity. The common evaluation techniques precision, recall and F-measure were used to measure the effectiveness of the proposed system. Amharic text corpus and Amharic Wikipedia data were used for testing. The study shows that the use of EK with word embedding for text document clustering results in 94.95% accuracy showing an average increment of 4.32 % than that of using only encyclopedic knowledge. Furthermore, changing the size of the class has a significant effect on the rate of accuracy and shows that as the cluster size increases the gap in rate of clustering accuracy between using EK with and without word embedding increases. Furthermore, since we do not use any language dependent information in the design process, our system can be applied to other natural language documents having EK.



Encyclopedic Knowledge, Neural Word Embedding, Concept Based Text Clustering, Feature Enrichment