School of Information Technology and Engineering
Permanent URI for this collection
Browse
Browsing School of Information Technology and Engineering by Title
Now showing 1 - 4 of 4
Results Per Page
Sort Options
Item Enhancing Neural Machine Translation Through Incorporation of Unsupervised Language Understanding and Generation Techniques: The Case of English-Afaan Oromo Translation(2024-05) Chala Bekabil; Fantahun Bogale (PhD)Breaking down language barriers is a paramount pursuit in the realm of Artificial Intelligence. Machine Translation (MT), a domain within Natural Language Processing (NLP), holds the potential to bridge linguistic gaps and foster global communication. Enhancing cross-cultural communication through MT will be realized only if we succeed in developing accurate and adaptable techniques which in turn demands adequate availability of linguistic resources. Unluckily, under-resourced languages face challenges due to limited linguistic resources and sparse parallel data. Previous studies tried to solve this problem by using monolingual pre-training techniques. However, such studies solely rely on either Language Understanding (LU) or Language Generation (LG) techniques resulting in skewed translation. This study aims to enhance translation outcomes beyond the capabilities of previous studies by marrying the concepts of LU and LG and hence boosting the quality of MT in both directions. Our proposed model, the BERT-GPT incorporated Transformer, combines SOTA language models, BERT and GPT, trained on monolingual data into the original Transformer model and demonstrates substantial improvements. Experimental results shows that translation quality leaps forward, as evidenced by a significant increase in the BLEU score reaching 42.09, from the baseline score of 35.75 for English to Afaan Oromo translation, and 44.51 from the baseline score of 40.35 for Afaan Oromo to English translation on test dataset. Notably, our model unveils a deep understanding of Afaan Oromo’s linguistic nuances, resulting in translations that are precise, contextually appropriate, and faithful to the original intent. By leveraging the power of unsupervised pre-training and incorporation of unsupervised LU and LG techniques to the transformer model, we pave the way for enhanced cross-cultural communication, advanced understanding and inclusivity in our interconnected world.Item Improving Knowledge Distillation For Smaller Networks Via Reducing Regularization(Addis Ababa University, 2023-05) Mubarek Mohammed; Beakal Gizachew(PhD)Knowledge Distillation (KD) is one of the numerous model compression methods that help reduce the size of models to address problems that come with large models. In KD a bigger model termed the teacher, transfers its knowledge, referred to as the Dark Knowledge (DK), to a smaller network usually termed the student network. The key part of the mechanism is a Distillation Loss added in the loss term that plays adual role: one as a regularizer and one as a carrier of the categorical information to be transferred from the teacher to the student which is sometimes termed DK [1]. It is known that the conventional KD does not produce high compression rates. Existing works focus on improving the general mechanism of KD and neglect the strong regularization entangled with the DK in the KD mechanism. The impact of reducing the regularization effect that comes entangled with DK remained unexplored. This research proposes a novel approach, which we termed Dark Knowledge Pruning (DKP), to lower this regularization effect in the form of a newly added term on the Distillation Loss. Experiments done across representative and benchmark datasets and models demonstrate the effectiveness of the proposed mechanism. We find that it can help improve the performance of a student against the baseline KD even in extreme compression, a phenomenon normally considered not well suited for KD. An increment of 3% is achieved in performance with a less regularized network on CIFAR 10 dataset with ResNet teacher and student models against the baseline KD. It also improves the current reported smallest result on ResNET 8 on the CIFAR-100 dataset from 61.82% to 62.4%. To the best of our knowledge, we are also the first to study the effect of reducing the regularizing nature of the distillation loss in KD when distilling into very small students. Beyond bridging Pruning and KD in an entirely new way, the proposed approach improves the understanding of knowledge transfer, helps achieve better performance out of very small students via KD, and poses questions for further research in the areas of model efficiency and knowledge transfer. Furthermore, it is model agnostic and showed interesting properties, and can potentially be extended for other interesting research such as quantifying DK.Item Integrating Hierarchical Attention and Context-Aware Embedding For Improved Word Sense Disambiguation Performance Using BiLSTM Model(Addis Ababa University, 2024-06) Robbel Habtamu; Beakal Gizachew (PhD)Word Sense Disambiguation is a fundamental task in natural language processing, aiming to determine the correct sense of a word based on its context. Word sense ambiguity, such as polysomy, and semantic ambiguity poses significant challenges in the task of WSD. Recent advancements in research have focused on utilizing deep contextual models to address these challenges. However, despite this positive progress, semantical ambiguity remains a challenge, especially when dealing with polysomy words. This research introduces a new approach that integrates hierarchical attention mechanisms and BERT embeddings to enhance WSD accuracy. Our model, incorporating both local and global attention, demonstrates significant improvements in accuracy, particularly in complex sentence structures. To the best of our knowledge, our model is the first to incorporate hierarchical attention mechanisms integrated with contextual embedding. This integration enhances the model’s performance, especially when combined with the contextual model BERT as word embeddings. Through extensive experimentation, we demonstrate the effectiveness of our proposed model. Our research highlights several key points. First, we showcase the effectiveness of hierarchical attention and contextual embeddings for WSD. Second, we adapted the model to Amharic word sense disambiguation, demonstrating strong performance. Despite the lack of a standard benchmark dataset for Amharic WSD, our model performs 92.4% Accuracy on a self-prepared dataset. Third, our findings emphasize the importance of linguistic features in capturing relevant contextual information for WSD. We also note that Part-of-Speech (POS) tagging has a less significant impact on our English data, while word embeddings significantly impact model performance. Furthermore, applying local and global attention leads to better results, with local attention at the word level showing promising results. Overall, our model achieves state-of-the-art results in WSD within the same framework. Our results demonstrate a significant improvement of 1.8% to 2.9% F1 score over baseline models. We also achieve state-of-the-art performance on the Italian language by achieving 0.5% to 0.7% F1 score over baseline papers. These findings underscore the importance of considering contextual information in WSD, paving the way for more sophisticated and context-aware natural language processing systems.Item Reinforcement Learning Based Layer Skipping Vision Transformer for Efficient Inference(Addis Ababa University, 2023-05) Amanuel Negash; Sammy Assefa (PhD)Recent advancements in language and vision tasks owe their success largely to the Transformer architecture. However, the computational requirements of these models have limited their applicability in resource-constrained environments. To address this issue, various techniques, such as Weight pruning, have been proven effective in reducing the deployment cost of such models. Additionally, methods tailored just for transformers, such as linear self-attention and token early exiting, have shown promise in making transformers more cost-effective. Nevertheless, these techniques often come with drawbacks such as decreased performance or additional training costs. This thesis proposes a layer-skipping dynamic vision transformer (ViT) network that skips layers depending on the given input based on decisions made by a reinforcement learning agent (RL). To the best of our knowledge, this work is the first to introduce such a model that not only significantly reduces the computational demands of transformers, but also improves performance. The proposed technique is extensively tested on various model sizes and three standard benchmarking datasets: CIFAR-10, CIFAR-100, and Tiny-ImageNet. First, we show that the dynamic models improve performance when compared to their state-of-the-art static counterparts. Second, we show that in comparison to these static models, they achieve an average inference speed boost of 53% across different model sizes, datasets, and batch sizes. Similarly, the technique lowers working space memory consumption by 53%, enabling larger input processing at a time without imposing an accuracy-speed trade-off. In addition, these models achieve very high accuracy when tested in transfer learning scenarios. We then show that, although these models have high accuracy, they can be optimized even more through post-training using genetic algorithms (NSGA-II). As such, we propose the joint RL-NSGA-II optimization technique, where the GA is aware of the dynamics of skipping through the RL reward. These optimized models achieve competitive performance compared to the already high-performing dynamic models while reducing the number of layers by 33%. In real-world applications, the technique translates to an average of 53% faster throughput, reduced power consumption, or lower computing costs without loss of accuracy.