Cross-Lingual Linking Framework of Wikipedia Articles

Alemu, Tehetena

Cross-Lingual Linking Framework of Wikipedia Articles

dc.contributor.advisor	Atnafu, Solomon (PhD)
dc.contributor.author	Alemu, Tehetena
dc.date.accessioned	2020-09-02T11:23:54Z
dc.date.accessioned	2023-11-09T16:18:46Z
dc.date.available	2020-09-02T11:23:54Z
dc.date.available	2023-11-09T16:18:46Z
dc.date.issued	2020-07-07
dc.description.abstract	These days, unstructured and textual data on the web are rapidly increasing and can be find in different languages. We can structure this kind of data by using text analysis with Machine learning methods. The Semantic web uses the obtained structured information to facilitate information extraction on the web. Machine learning methods provide a different way of querying web data using statistical techniques. Among the existing machine learning methods, Latent Dirichlet Allocation (LDA) is the effective one and is a widely used method to analyze unstructured data and to perform allocation.However, LDA may not necessarily be efficient with short documents. Documents with fewer than 50 words are mostly found in tweet texts and Amharic Wikipedia articles. LDA model’s depends on the language specific stop-words and on the stemming algorithm for finding the optimal topic. In this thesis, we study the application of machine learning approach that links similar Amharic Wikipedia article with its corresponding English Wikipedia article. In this thesis, we designed a model which is based on LDA model. We evaluate our model by finding textual similarities between Amharic and English documents. Our model uses stability metric to find the topics in multilanguage texts. And we also propose a method to find conceptual similarity and aggregated it with the result of textual similarities. We use the aggregated result to link Amharic Wikipedia with English Wikipedia which becomes bi-lingual corpus. This allows to improve the Amharic information retrieval on the web. We show how accurate our model is in terms of precision and recall in order to perfectly align Amharic with English Wikipedia. We also showed that our topic stability metrics are related to the contents of the topics with proximity and topic coherence measure. We analysed masses of text in Wikipedia using LDA, and observed wrong topic detection on Amharic Wikipedia Articles. Our model which is based on Wikipedia inter-language link and Word Embedding model improved the result of the LDA model. Our approach shows stability for unbalanced size of Wikipedia articles. Our experimental results demonstrate also that our proposed model improves the LDA based bilingual document similarity score and provides better overall accuracy and significantly outperforms previous studies.	en_US
dc.identifier.uri	http://etd.aau.edu.et/handle/12345678/22241
dc.language.iso	en	en_US
dc.publisher	Addis Ababa University	en_US
dc.subject	Semantic Similarity	en_US
dc.subject	Topic Models	en_US
dc.subject	LDA	en_US
dc.subject	Word Embeddings	en_US
dc.subject	Document similarity	en_US
dc.subject	Cross-lingual Linking	en_US
dc.title	Cross-Lingual Linking Framework of Wikipedia Articles	en_US
dc.type	Thesis	en_US

Files

Original bundle

Now showing 1 - 1 of 1

Name:: Tehetena Alemu 2020.pdf
Size:: 1.62 MB
Format:: Adobe Portable Document Format

Download

License bundle

Now showing 1 - 1 of 1

Name:: license.txt
Size:: 1.71 KB
Format:: Plain Text
Description:

Download

Collections

Chemistry