Unsupervised Text Document Clustering  Using Encyclopedic Knowledge With Word Embedding

Yohannes, Dessalew

Unsupervised Text Document Clustering Using Encyclopedic Knowledge With Word Embedding

dc.contributor.advisor	Assabie, Yaregal (PhD)
dc.contributor.author	Yohannes, Dessalew
dc.date.accessioned	2019-11-14T06:16:01Z
dc.date.accessioned	2023-11-04T12:22:49Z
dc.date.available	2019-11-14T06:16:01Z
dc.date.available	2023-11-04T12:22:49Z
dc.date.issued	10/5/2018
dc.description.abstract	Digital technologies have made very easy and cheap to generate, store and publish different kinds of data. Thus, almost in every discipline, people are using automated systems that generate information represented in text format in different natural languages. As a result, there is a growing interest towards better solutions for finding, organizing and analyzing these text documents. The effective ways of rearranging the huge amount of text document form later processing, navigating and browsing less complicated, friendly and efficient. Text document clustering is one of the common methods of organizing text documents. In recent years, Encyclopedic Knowledge (EK) is used in different data mining tasks including text document clustering. Moreover, with the recent advances in machine learning, word embedding is a modern approach for feature learning techniques in natural language documents that is built on the idea that semantics of a word arise simply from its context. Previous works on text clustering do not consider the advantages of using EK with word embedding. In order to improve the performance of text document clustering, this study propose a system that clusters text documents using EK with neural word embedding. EK enables the representation of different related concepts and neural word embedding is used to handle the contexts of these relatedness. During the clustering process, all the text documents pass through pre-processing stages. Then enriched text document features were extracted from each document through mapping with EK and trained word embedding model. Finally, text documents are clustered using the most popular spherical K-means algorithm, that is based on the cosine similarity. The common evaluation techniques precision, recall and F-measure were used to measure the effectiveness of the proposed system. Amharic text corpus and Amharic Wikipedia data were used for testing. The study shows that the use of EK with word embedding for text document clustering results in 94.95% accuracy showing an average increment of 4.32 % than that of using only encyclopedic knowledge. Furthermore, changing the size of the class has a significant effect on the rate of accuracy and shows that as the cluster size increases the gap in rate of clustering accuracy between using EK with and without word embedding increases. Furthermore, since we do not use any language dependent information in the design process, our system can be applied to other natural language documents having EK.	en_US
dc.identifier.uri	http://etd.aau.edu.et/handle/123456789/20115
dc.language.iso	en	en_US
dc.publisher	Addis Ababa University	en_US
dc.subject	Encyclopedic Knowledge	en_US
dc.subject	Neural Word Embedding	en_US
dc.subject	Concept Based Text Clustering	en_US
dc.subject	Feature Enrichment	en_US
dc.title	Unsupervised Text Document Clustering Using Encyclopedic Knowledge With Word Embedding	en_US
dc.type	Thesis	en_US

Files

Original bundle

Now showing 1 - 1 of 1

Name:: Dessalew Yohannes 2018.pdf
Size:: 2.72 MB
Format:: Adobe Portable Document Format

Download

License bundle

Now showing 1 - 1 of 1

Name:: license.txt
Size:: 1.71 KB
Format:: Plain Text
Description:

Download

Collections

Computer Science