Deepfake Video Detection Using Convolutional Vision Transformer

No Thumbnail Available



Journal Title

Journal ISSN

Volume Title


Addis Ababa University


The rapid advancement of deep learning models that can generate and synthesis hyper-realistic videos known as Deepfakes and their ease of access to the general public have raised concern from all concerned bodies to their possible malicious intent use. Deep learning techniques can now generate faces, swap faces between two subjects in a video, alter facial expressions, change gender, and alter facial features, to list a few. These powerful video manipulation methods have potential use in many fields. However, they also pose a looming threat to everyone if used for harmful purposes such as identity theft, phishing, and scam. Therefore, it is important to tell whether a specific video is real or manipulated to deter and mitigate the risks posed by Deepfakes. Thus, in this thesis work, we present a system that detects whether a specific video is real or Deepfake. The proposed system has two components: the preprocessing and the detection component. The preprocessing prepares the video dataset for the detection stage. In the preprocessing, the face region is extracted in 224 x 224 RGB format. Data augmentation is applied to increase the dataset and also increase the accuracy of the model. For the detection component, we use a Convolutional Neural Network (CNN) and Vision Transformer (ViT). The CNN has only convolutional operations (without a fully connected layer), and its purpose is to extract learnable features. The ViT takes in the learned features as input and further encodes them for the final detection purposes. The proposed system is implemented using PyTorch, an open-source machine learning library. The DeepFake Detection Challenge Dataset (DFDC) was used to train, validate, and test the model. The DFDC dataset contains 119,154 videos created using publicly available video generation deep learning models. Our model was trained on 162,174 face images extracted from the video dataset. Ninety percent of the face images are augmented during training and validation. We tested the model on 400 unseen videos, and have achieved 91.5 percent accuracy, an AUC value of 0.91, and a loss value of 0.32. Our contribution is that we have added a CNN module to the ViT architecture and have achieved a competitive result on the DFDC dataset.



Deep Learning, Deepfakes, Deepfakes Video Detection, Cnn, Transformer, Vision Transformer, Convolutional Vision Transformer, Gan