School of Information Technology and Engineering

Permanent URI for this collection

Browse

Recent Submissions

Now showing 1 - 2 of 2
  • Item
    Reinforcement Learning Based Layer Skipping Vision Transformer for Efficient Inference
    (Addis Ababa University, 2023-05) Amanuel Negash; Sammy Assefa (PhD)
    Recent advancements in language and vision tasks owe their success largely to the Transformer architecture. However, the computational requirements of these models have limited their applicability in resource-constrained environments. To address this issue, various techniques, such as Weight pruning, have been proven effective in reducing the deployment cost of such models. Additionally, methods tailored just for transformers, such as linear self-attention and token early exiting, have shown promise in making transformers more cost-effective. Nevertheless, these techniques often come with drawbacks such as decreased performance or additional training costs. This thesis proposes a layer-skipping dynamic vision transformer (ViT) network that skips layers depending on the given input based on decisions made by a reinforcement learning agent (RL). To the best of our knowledge, this work is the first to introduce such a model that not only significantly reduces the computational demands of transformers, but also improves performance. The proposed technique is extensively tested on various model sizes and three standard benchmarking datasets: CIFAR-10, CIFAR-100, and Tiny-ImageNet. First, we show that the dynamic models improve performance when compared to their state-of-the-art static counterparts. Second, we show that in comparison to these static models, they achieve an average inference speed boost of 53% across different model sizes, datasets, and batch sizes. Similarly, the technique lowers working space memory consumption by 53%, enabling larger input processing at a time without imposing an accuracy-speed trade-off. In addition, these models achieve very high accuracy when tested in transfer learning scenarios. We then show that, although these models have high accuracy, they can be optimized even more through post-training using genetic algorithms (NSGA-II). As such, we propose the joint RL-NSGA-II optimization technique, where the GA is aware of the dynamics of skipping through the RL reward. These optimized models achieve competitive performance compared to the already high-performing dynamic models while reducing the number of layers by 33%. In real-world applications, the technique translates to an average of 53% faster throughput, reduced power consumption, or lower computing costs without loss of accuracy.
  • Item
    Improving Knowledge Distillation For Smaller Networks Via Reducing Regularization
    (Addis Ababa University, 2023-05) Mubarek Mohammed; Beakal Gizachew(PhD)
    Knowledge Distillation (KD) is one of the numerous model compression methods that help reduce the size of models to address problems that come with large models. In KD a bigger model termed the teacher, transfers its knowledge, referred to as the Dark Knowledge (DK), to a smaller network usually termed the student network. The key part of the mechanism is a Distillation Loss added in the loss term that plays a dual role: one as a regularizer and one as a carrier of the categorical information to be transferred from the teacher to the student which is sometimes termed DK [1]. It is known that the conventional KD does not produce high compression rates. Existing works focus on improving the general mechanism of KD and neglect the strong regularization entangled with the DK in the KD mechanism. The impact of reducing the regularization effect that comes entangled with DK remained unexplored. This research proposes a novel approach, which we termed Dark Knowledge Pruning (DKP), to lower this regularization effect in the form of a newly added term on the Distillation Loss. Experiments done across representative and benchmark datasets and models demonstrate the effectiveness of the proposed mechanism. We find that it can help improve the performance of a student against the baseline KD even in extreme compression, a phenomenon normally considered not well suited for KD. An increment of 3% is achieved in performance with a less regularized network on CIFAR 10 dataset with ResNet teacher and student models against the baseline KD. It also improves the current reported smallest result on ResNET 8 on the CIFAR-100 dataset from 61.82% to 62.4%. To the best of our knowledge, we are also the first to study the effect of reducing the regularizing nature of the distillation loss in KD when distilling into very small students. Beyond bridging Pruning and KD in an entirely new way, the proposed approach improves the understanding of the knowledge transfer, helps achieve better performance out of very small students via KD, and poses questions for further research in the areas of model efficiency and knowledge transfer. Furthermore, it is model agnostic and showed interesting properties, and can potentially be extended for other interesting research such as quantifying DK.