Improving Knowledge Distillation For Smaller Networks Via Reducing Regularization

No Thumbnail Available

Date

2023-05

Journal Title

Journal ISSN

Volume Title

Publisher

Addis Ababa University

Abstract

Knowledge Distillation (KD) is one of the numerous model compression methods that help reduce the size of models to address problems that come with large models. In KD a bigger model termed the teacher, transfers its knowledge, referred to as the Dark Knowledge (DK), to a smaller network usually termed the student network. The key part of the mechanism is a Distillation Loss added in the loss term that plays a dual role: one as a regularizer and one as a carrier of the categorical information to be transferred from the teacher to the student which is sometimes termed DK [1]. It is known that the conventional KD does not produce high compression rates. Existing works focus on improving the general mechanism of KD and neglect the strong regularization entangled with the DK in the KD mechanism. The impact of reducing the regularization effect that comes entangled with DK remained unexplored. This research proposes a novel approach, which we termed Dark Knowledge Pruning (DKP), to lower this regularization effect in the form of a newly added term on the Distillation Loss. Experiments done across representative and benchmark datasets and models demonstrate the effectiveness of the proposed mechanism. We find that it can help improve the performance of a student against the baseline KD even in extreme compression, a phenomenon normally considered not well suited for KD. An increment of 3% is achieved in performance with a less regularized network on CIFAR 10 dataset with ResNet teacher and student models against the baseline KD. It also improves the current reported smallest result on ResNET 8 on the CIFAR-100 dataset from 61.82% to 62.4%. To the best of our knowledge, we are also the first to study the effect of reducing the regularizing nature of the distillation loss in KD when distilling into very small students. Beyond bridging Pruning and KD in an entirely new way, the proposed approach improves the understanding of the knowledge transfer, helps achieve better performance out of very small students via KD, and poses questions for further research in the areas of model efficiency and knowledge transfer. Furthermore, it is model agnostic and showed interesting properties, and can potentially be extended for other interesting research such as quantifying DK.

Description

Keywords

Deep Learning, Neural Networks, Model Compression

Citation