Unsupervised Text Document Clustering Using Encyclopedic Knowledge With Word Embedding

dc.contributor.advisorAssabie, Yaregal (PhD)
dc.contributor.authorYohannes, Dessalew
dc.date.accessioned2019-11-14T06:16:01Z
dc.date.accessioned2023-11-04T12:22:49Z
dc.date.available2019-11-14T06:16:01Z
dc.date.available2023-11-04T12:22:49Z
dc.date.issued10/5/2018
dc.description.abstractDigital technologies have made very easy and cheap to generate, store and publish different kinds of data. Thus, almost in every discipline, people are using automated systems that generate information represented in text format in different natural languages. As a result, there is a growing interest towards better solutions for finding, organizing and analyzing these text documents. The effective ways of rearranging the huge amount of text document form later processing, navigating and browsing less complicated, friendly and efficient. Text document clustering is one of the common methods of organizing text documents. In recent years, Encyclopedic Knowledge (EK) is used in different data mining tasks including text document clustering. Moreover, with the recent advances in machine learning, word embedding is a modern approach for feature learning techniques in natural language documents that is built on the idea that semantics of a word arise simply from its context. Previous works on text clustering do not consider the advantages of using EK with word embedding. In order to improve the performance of text document clustering, this study propose a system that clusters text documents using EK with neural word embedding. EK enables the representation of different related concepts and neural word embedding is used to handle the contexts of these relatedness. During the clustering process, all the text documents pass through pre-processing stages. Then enriched text document features were extracted from each document through mapping with EK and trained word embedding model. Finally, text documents are clustered using the most popular spherical K-means algorithm, that is based on the cosine similarity. The common evaluation techniques precision, recall and F-measure were used to measure the effectiveness of the proposed system. Amharic text corpus and Amharic Wikipedia data were used for testing. The study shows that the use of EK with word embedding for text document clustering results in 94.95% accuracy showing an average increment of 4.32 % than that of using only encyclopedic knowledge. Furthermore, changing the size of the class has a significant effect on the rate of accuracy and shows that as the cluster size increases the gap in rate of clustering accuracy between using EK with and without word embedding increases. Furthermore, since we do not use any language dependent information in the design process, our system can be applied to other natural language documents having EK.en_US
dc.identifier.urihttp://etd.aau.edu.et/handle/123456789/20115
dc.language.isoenen_US
dc.publisherAddis Ababa Universityen_US
dc.subjectEncyclopedic Knowledgeen_US
dc.subjectNeural Word Embeddingen_US
dc.subjectConcept Based Text Clusteringen_US
dc.subjectFeature Enrichmenten_US
dc.titleUnsupervised Text Document Clustering Using Encyclopedic Knowledge With Word Embeddingen_US
dc.typeThesisen_US

Files

Original bundle
Now showing 1 - 1 of 1
No Thumbnail Available
Name:
Dessalew Yohannes 2018.pdf
Size:
2.72 MB
Format:
Adobe Portable Document Format
License bundle
Now showing 1 - 1 of 1
No Thumbnail Available
Name:
license.txt
Size:
1.71 KB
Format:
Plain Text
Description:

Collections