DEV Community

loading...
Cover image for Principal Component Analysis (PCA)

Principal Component Analysis (PCA)

xsabzal profile image Abzal Seitkaziyev ・2 min read

In the previous post, I discussed applying Principal Component Analysis (PCA) as a way to reduce feature space, by capturing maximum variance. So, why is variance a big deal? If there is no or little variation, it means we are not getting enough predictive information from the feature.

Here, I wanted to experiment with covariance and correlation matrices and try to derive principal components using linear algebra. I will use the London bike sharing dataset from my previous blog.

### use 4 features
features = data.iloc[:,:4]
features.sample(5)
Enter fullscreen mode Exit fullscreen mode

Alt Text

So, the idea is to find eigenvalues and eigenvectors of the correlation matrix. Then, we can use eigenvalues to describe the explained variation ratio.

Let’s start with the correlation matrix. We can use either original features or standardized for correlation matrix calculation, as it gives the same result.

from sklearn.preprocessing import StandardScaler
# standardize features
features_scaled = StandardScaler().fit_transform(features)

# get correlation matrix on scaled features
np.corrcoef(features_scaled.T)

# correlation matrix on unscaled data gives the same result
# features.corr()

e_vals, e_vects = np.linalg.eig(np.corrcoef(features_scaled.T)) 

# normalize eigenvalues to get variations for each PC 
# sorted explained variation
# use np.sort(e_val)[::-1] to sort in descending order
print([x/e_vals.sum() for x in np.sort(e_vals)[::-1]])

Enter fullscreen mode Exit fullscreen mode

Alt Text

We could do the same using the covariance matrix, but it requires a few more extra matrix manipulations before getting eigenvalues. You can find the code here.

Let's check if we get the same result using sklearn’s PCA. I would like to mention that per sklearn documentation, it uses SVD decomposition instead of eigendecomposition as we did above and we also need to standardize the data.

# Use sklearn's PCA
from sklearn.decomposition import PCA

pca = PCA()

#fit and transform scaled features
pca.fit_transform(features_scaled)

# get explained varion ratio
print(pca.explained_variance_ratio_)
Enter fullscreen mode Exit fullscreen mode

Alt Text

Alt Text

As we can see, we got the same variance ratio. Based on that analysis, we can further decide how many principal components we would use and accordingly apply the projection matrix or sklearn to transform features into principal components.

Discussion

pic
Editor guide