Improving Knowledge Distillation For Smaller Networks Via Reducing Regularization
No Thumbnail Available
Date
2023-05
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Addis Ababa University
Abstract
Knowledge Distillation (KD) is one of the numerous model compression methods
that help reduce the size of models to address problems that come with large models.
In KD a bigger model termed the teacher, transfers its knowledge, referred to as the
Dark Knowledge (DK), to a smaller network usually termed the student network. The
key part of the mechanism is a Distillation Loss added in the loss term that plays a
dual role: one as a regularizer and one as a carrier of the categorical information to
be transferred from the teacher to the student which is sometimes termed DK [1]. It
is known that the conventional KD does not produce high compression rates. Existing
works focus on improving the general mechanism of KD and neglect the strong
regularization entangled with the DK in the KD mechanism. The impact of reducing
the regularization effect that comes entangled with DK remained unexplored. This research
proposes a novel approach, which we termed Dark Knowledge Pruning (DKP),
to lower this regularization effect in the form of a newly added term on the Distillation
Loss. Experiments done across representative and benchmark datasets and models
demonstrate the effectiveness of the proposed mechanism. We find that it can help
improve the performance of a student against the baseline KD even in extreme compression,
a phenomenon normally considered not well suited for KD. An increment of
3% is achieved in performance with a less regularized network on CIFAR 10 dataset
with ResNet teacher and student models against the baseline KD. It also improves the
current reported smallest result on ResNET 8 on the CIFAR-100 dataset from 61.82%
to 62.4%. To the best of our knowledge, we are also the first to study the effect of
reducing the regularizing nature of the distillation loss in KD when distilling into very
small students. Beyond bridging Pruning and KD in an entirely new way, the proposed
approach improves the understanding of the knowledge transfer, helps achieve
better performance out of very small students via KD, and poses questions for further
research in the areas of model efficiency and knowledge transfer. Furthermore, it is
model agnostic and showed interesting properties, and can potentially be extended for
other interesting research such as quantifying DK.
Description
Keywords
Deep Learning, Neural Networks, Model Compression