Cross-Lingual Linking Framework of Wikipedia Articles

dc.contributor.advisorAtnafu, Solomon (PhD)
dc.contributor.authorAlemu, Tehetena
dc.date.accessioned2020-09-02T11:23:54Z
dc.date.accessioned2023-11-09T16:18:46Z
dc.date.available2020-09-02T11:23:54Z
dc.date.available2023-11-09T16:18:46Z
dc.date.issued2020-07-07
dc.description.abstractThese days, unstructured and textual data on the web are rapidly increasing and can be find in different languages. We can structure this kind of data by using text analysis with Machine learning methods. The Semantic web uses the obtained structured information to facilitate information extraction on the web. Machine learning methods provide a different way of querying web data using statistical techniques. Among the existing machine learning methods, Latent Dirichlet Allocation (LDA) is the effective one and is a widely used method to analyze unstructured data and to perform allocation.However, LDA may not necessarily be efficient with short documents. Documents with fewer than 50 words are mostly found in tweet texts and Amharic Wikipedia articles. LDA model’s depends on the language specific stop-words and on the stemming algorithm for finding the optimal topic. In this thesis, we study the application of machine learning approach that links similar Amharic Wikipedia article with its corresponding English Wikipedia article. In this thesis, we designed a model which is based on LDA model. We evaluate our model by finding textual similarities between Amharic and English documents. Our model uses stability metric to find the topics in multilanguage texts. And we also propose a method to find conceptual similarity and aggregated it with the result of textual similarities. We use the aggregated result to link Amharic Wikipedia with English Wikipedia which becomes bi-lingual corpus. This allows to improve the Amharic information retrieval on the web. We show how accurate our model is in terms of precision and recall in order to perfectly align Amharic with English Wikipedia. We also showed that our topic stability metrics are related to the contents of the topics with proximity and topic coherence measure. We analysed masses of text in Wikipedia using LDA, and observed wrong topic detection on Amharic Wikipedia Articles. Our model which is based on Wikipedia inter-language link and Word Embedding model improved the result of the LDA model. Our approach shows stability for unbalanced size of Wikipedia articles. Our experimental results demonstrate also that our proposed model improves the LDA based bilingual document similarity score and provides better overall accuracy and significantly outperforms previous studies.en_US
dc.identifier.urihttp://etd.aau.edu.et/handle/12345678/22241
dc.language.isoenen_US
dc.publisherAddis Ababa Universityen_US
dc.subjectSemantic Similarityen_US
dc.subjectTopic Modelsen_US
dc.subjectLDAen_US
dc.subjectWord Embeddingsen_US
dc.subjectDocument similarityen_US
dc.subjectCross-lingual Linkingen_US
dc.titleCross-Lingual Linking Framework of Wikipedia Articlesen_US
dc.typeThesisen_US

Files

Original bundle
Now showing 1 - 1 of 1
No Thumbnail Available
Name:
Tehetena Alemu 2020.pdf
Size:
1.62 MB
Format:
Adobe Portable Document Format
License bundle
Now showing 1 - 1 of 1
No Thumbnail Available
Name:
license.txt
Size:
1.71 KB
Format:
Plain Text
Description:

Collections