DEV Community

Sıddık AÇIL
Sıddık AÇIL

Posted on

Implementing Simple PCA using NumPy

I am open to job offers, feel free to contact me for any vacancies abroad.

In this article, I will implement PCA algorithm from scratch using Python's NumPy. To test my results, I used PCA implementation of scikit-learn.


from sklearn.decomposition import PCA
import numpy as np
k = 1 # target dimension(s)
pca = PCA(k) # Create a new PCA instance

data = np.array([[0.5, 1], [0, 0]]) # 2x2 data matrix
print("Data: ", data)
print("Reduced: ", pca.fit_transform(data)) # fit and transform

This results in:


[[-0.55901699]
 [ 0.55901699]]

Centering Data Points

Make origin the centroid of our data.


data = data - data.mean(axis=0) # Center data points
print("Centered Matrix: ", data)

Get Covariance Matrix

Get covariance matrix of our features.


cov = np.cov(data.T) / data.shape[0] # Get covariance matrix
print("Covariance matrix: ", cov)


Perform Eigendecomposition on Covariance Matrix

Eigendecomposition extracts eigenvalues and corresponding eigenvectors of a matrix


v, w = np.linalg.eig(cov)

Sort Eigenvectors According to Eigenvalues

Most numerical libraries offer eigenvectors pre-sorted, However, this is not the case for NumPy. Therefore, we need to argsort the eigenvalue vector to get sorting indices and perform sorting on columns of eigenvalues.


idx = v.argsort()[::-1] # Sort descending and get sorted indices
v = v[idx] # Use indices on eigv vector
w = w[:,idx] # 

print("Eigenvalue vektoru: ", v)
print("Eigenvektorler: ", w)

Get First K Eigenvectors

Our aim in PCA is to construct a new feature space. Eigenvectors are the axes of this new feature space and eigenvalues denote the magnitude of variance along that axis. In other words, a higher eigenvalue means more variance on the corresponding principal axis. Therefore, the set of axes with the highest variances are the most important features in this new feature space since they hold most of the information. By getting the first K columns of eigenvector matrix, which have the K biggest eigenvalues, we form what is called a projection matrix. The dot product of our data matrix and projection matrix, which sounds pretty cool but it is actually pretty straightforward, is the reduced feature space, the result of PCA.


print("Sonuc: ", data.dot(w[:, :k]))
# Get the dot product of w with the first K columns of eigenvector matrix
# (a.k.a) projection matrix

Which also results in:


[[-0.55901699]
 [ 0.55901699]]

This is it. Any corrections or suggestions are welcome.

Extras

Square vs. Non-Square: Eigendecomposition and SVD

Original Turkish article at

My go to article on PCA

Top comments (0)