Dimensionality Reduction and Classification Using Improved Principal Component Analysis (Pca) and Linear Discriminant Analysis (Lda)

No Thumbnail Available



Journal Title

Journal ISSN

Volume Title


Addis Ababa University


Principal component analysis (PCA) and Linear Discriminant Analysis (LDA) are two popular methods for dimensionality reduction. PCA is a multivariate data analysis method, which uses an orthogonal transformation to convert a set of possibly correlated observations into a set of linearly uncorrelated components called principal components, whereas LDA, is a method to find a linear combination of observations which separates two or more classes of objects by finding a low dimensional subspace that keeps data points from different classes far apart and those from the same class as close as possible. In this study, dimensionality reduction and classification were performed using improved PCA and LDA in order to identify the most important discriminant variables (CGAs) from the phenolic compounds content dataset of the green coffee beans for the purpose of identifying their geographical origin. The dataset used in this work were extracted from published article (Mehari, B. et al., 2016) by applying the Box-Muller method on the mean and standard deviation values of the green coffee beans given for each regional and sub-regional category. Prior to constructing PCA model, the suitability of dataset was assessed using Kaiser-Meyer-Olkin (KMO) measure of sampling adequacy and Bartlett’s test of Sphericity. Subsequently, the dataset were subjected to principal component analysis (PCA) with Varimax rotation method to identify the most discriminating compound corresponding to the green coffee beans and LDA model was developed to classify the coffee beans. The findings of this work showed 3-caffeoylquinic acid (3-CQA), 4,5-dicaffeoylquinic acid (4,5-diCQA), 3,5-dicaffeoylquinic acid (3,5-CQA) to 4,5-dicaffeoylquinic acid (4,5-diCQA) concentration ratio, and 4,5-dicaffeoylquinic acid (4,5-diCQA) to 3,4-dicaffeoylquinic acid (3,4-diCQA) concentration ratio were identified as the most discriminating compounds for the authentication of the various regional green coffee beans. Among these, 3-CQA and 4,5-diCQA were selected as suitable discriminant marker compounds for green coffee beans originating from Northwest (Benishangul and Finoteselam) and East (Harar) studied regions, respectively, both at regional and sub-regional levels. Moreover, at sub-regional level, sample of coffee beans from Jimma A, Wollega, and Sidama SA were distinguished by the 3,5-diCQA to 4,5-diCQA concentration ratio while the 4,5-diCQA to 3,4-diCQA concentration ratio was found appropriate to differentiate coffee beans from Yirgachefe and Jimma-B from the other coffee varieties. The results of LDA were in line with the PCA results, indicating that the LDA model was able to classify almost all of the coffee beans accurately based on the their geographical origin. The recognition and prediction abilities of the LDA model were 94% and 92.4%, respectively, at the regional level and 94.3% and 93.3%, respectively, at the sub-regional level and hence, best discrimination of green coffee beans was achieved both at regional and sub-regional. Further, comparisons between results obtained in this work and provided in the literature demonstrate the superiority of the improved methods.



Eigenvalue, Eigenvector, Dimensionality Reduction, Classification, Principal Component Analysis, Linear Discriminant Analysis, Coffee Beans