Cross-Lingual Linking Framework of Wikipedia Articles
No Thumbnail Available
Date
2020-07-07
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Addis Ababa University
Abstract
These days, unstructured and textual data on the web are rapidly increasing and can be find in different languages. We can structure this kind of data by using text analysis with Machine learning methods. The Semantic web uses the obtained structured information to facilitate information extraction on the web. Machine learning methods provide a different way of querying web data using statistical techniques. Among the existing machine learning methods, Latent Dirichlet Allocation (LDA) is the effective one and is a widely used method to analyze unstructured data and to perform allocation.However, LDA may not necessarily be efficient with short documents. Documents with fewer than 50 words are mostly found in tweet texts and Amharic Wikipedia articles. LDA model’s depends on the language specific stop-words and on the stemming algorithm for finding the optimal topic. In this thesis, we study the application of machine learning approach that links similar Amharic Wikipedia article with its corresponding English Wikipedia article.
In this thesis, we designed a model which is based on LDA model. We evaluate our model by finding textual similarities between Amharic and English documents. Our model uses stability metric to find the topics in multilanguage texts. And we also propose a method to find conceptual similarity and aggregated it with the result of textual similarities. We use the aggregated result to link Amharic Wikipedia with English Wikipedia which becomes bi-lingual corpus. This allows to improve the Amharic information retrieval on the web. We show how accurate our model is in terms of precision and recall in order to perfectly align Amharic with English Wikipedia. We also showed that our topic stability metrics are related to the contents of the topics with proximity and topic coherence measure.
We analysed masses of text in Wikipedia using LDA, and observed wrong topic detection on Amharic Wikipedia Articles. Our model which is based on Wikipedia inter-language link and Word Embedding model improved the result of the LDA model. Our approach shows stability for unbalanced size of Wikipedia articles. Our experimental results demonstrate also that our proposed model improves the LDA based bilingual document similarity score and provides better overall accuracy and significantly outperforms previous studies.
Description
Keywords
Semantic Similarity, Topic Models, LDA, Word Embeddings, Document similarity, Cross-lingual Linking