Reinforcement Learning Based Layer Skipping Vision Transformer for Efficient Inference
No Thumbnail Available
Date
2023-05
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Addis Ababa University
Abstract
Recent advancements in language and vision tasks owe their success largely to the
Transformer architecture. However, the computational requirements of these models
have limited their applicability in resource-constrained environments. To address this
issue, various techniques, such as Weight pruning, have been proven effective in reducing
the deployment cost of such models. Additionally, methods tailored just for
transformers, such as linear self-attention and token early exiting, have shown promise
in making transformers more cost-effective. Nevertheless, these techniques often come
with drawbacks such as decreased performance or additional training costs. This thesis
proposes a layer-skipping dynamic vision transformer (ViT) network that skips layers
depending on the given input based on decisions made by a reinforcement learning
agent (RL). To the best of our knowledge, this work is the first to introduce such a
model that not only significantly reduces the computational demands of transformers,
but also improves performance. The proposed technique is extensively tested on various
model sizes and three standard benchmarking datasets: CIFAR-10, CIFAR-100,
and Tiny-ImageNet. First, we show that the dynamic models improve performance
when compared to their state-of-the-art static counterparts. Second, we show that in
comparison to these static models, they achieve an average inference speed boost of
53% across different model sizes, datasets, and batch sizes. Similarly, the technique
lowers working space memory consumption by 53%, enabling larger input processing
at a time without imposing an accuracy-speed trade-off. In addition, these models
achieve very high accuracy when tested in transfer learning scenarios. We then show
that, although these models have high accuracy, they can be optimized even more
through post-training using genetic algorithms (NSGA-II). As such, we propose the
joint RL-NSGA-II optimization technique, where the GA is aware of the dynamics of
skipping through the RL reward. These optimized models achieve competitive performance
compared to the already high-performing dynamic models while reducing the
number of layers by 33%. In real-world applications, the technique translates to an
average of 53% faster throughput, reduced power consumption, or lower computing
costs without loss of accuracy.
Description
Keywords
Based Layer Skipping Vision Transformer, Efficient Inference